Demand Forecasting for Yield 🌾

> #



Description

A new fast food chain is seeing rapid expansion over the past couple of years. They are now trying to optimize their supply chain to ensure that there are no shortages of ingredients. For this, they’ve tasked their data science team to come up with a model that could predict the output of each food processing farm over the next few years.

These predictions could further increase the efficiency of their current supply chain management systems. In this competition you are expected to build a machine learning model(s) that could predict the output of the food processing farms for the next year.


About Data: There are 5 datasets along with a sample submission file provided to you in this competition. The datasets are named as follows:

Installing Necessary Libraries

Importing Libraries

Import Data

Insights:

  1. The shape of data after merging is (1290364, 16)
  2. There are more null values in the data and imputing if the null percentage is greater than 20 is tough so better to remove the columns and the columns are [operating_commencing _year,num plants,cloudiness,]

Preprocessing

Dropping ingredient_type,farming_company .Because ingredient type variance is constant and farming_company doesnt add value to data its for insights and in hirerchal order farm_company have overlapping farms so its messy to understand the data based on farm company.Hence remove the column

Aggregating attribute wise:

As per the data record i.e (24 366 145)= 1273680 ,But the size differs

  1. For same timestamp there are multiple records
  2. There are missing dates

Imputing Missing Dates

Now Data is ready to rock! But need to impute the missing values

Imputing Null values

Iterative Imputer:

Iterative Imputer is a machine learning technique used for imputing missing values in a dataset. It works by filling in the missing values with predicted values that are generated by iterating over the features of the dataset.

The iterative imputer algorithm uses a regression model to predict the missing values for each feature, and then repeats this process until the missing values converge to a stable solution. The algorithm is based on the idea that the missing values in a dataset are not completely random, but rather are correlated with other features in the dataset.

Renaming variable names such that model can understand the variables easily

Ploting Yield vs time(Univariate Analysis)

Insights:

Ploting Exogenous Variables(Multivariate)

Train - Test Split

Model Building

Features

Evaluate results

Save the Model